# install.packages("tidyverse")
library(ggplot2)
# library(tidyverse)R Visualization
Visualization
Visualization using ggplot2 package
Although R has some functions built in that provide visualizations, the ggplot2 package is a go-to for visualization needs. ggplot2 is a package that is part of a group of packages called tidyverse. Since we will use other packages within the tidyverse, we can install and load the tidyverse library to use ggplot2. Alternatively, you can install just ggplot2 and load the the ggplot2 package into your environment.
ggplot2 Cheatsheet: https://rstudio.github.io/cheatsheets/data-visualization.pdf
Note: This tutorial builds from R for Data Science (https://r4ds.hadley.nz/data-visualize). See https://ggplot2.tidyverse.org/ for other great resources.
diamonds Data Frame
We will be working with a data frame that is included in the ggplot2 package.
A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). diamonds contains the prices and other attributes of almost 54,000 diamonds.
Take a look at the data.
head(diamonds)# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
We can also load the data into the RStudio global environment.
data(diamonds)Or get more information using ?diamonds.
Histogram
Show the distribution of weights of the diamonds (carats) in “bins.” Each bin includes a range of values where the frequency of the data is shown as the height of the bar.
Base R Histogram
# base R visualization
hist(diamonds$carat)ggplot Histogram (default settings)
# ggplot visualization with default settings
ggplot(data = diamonds, aes(carat)) +
geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot: add specifications
# ggplot visualization with specifying bins, fill, and color
ggplot(data = diamonds, aes(carat)) +
geom_histogram(bins = 12,
fill = 'lightblue',
color = 'black')ggplot: add a layer for labels and theme
ggplot(data = diamonds, aes(carat)) +
geom_histogram(bins = 12,
fill = 'lightblue',
color = 'black') +
labs(title = 'Histogram of Carat',
x = 'Carat',
y = 'Frequency'
) +
theme_minimal()ggplot: store the plot as an object
p <- ggplot(diamonds, aes(carat)) +
geom_histogram(bins = 12,
fill = 'lightblue',
color = 'black') +
labs(title = 'Histogram of Carat',
x = 'Carat',
y = 'Frequency'
) +
theme_minimal()ggplot: add interactivity with plotly
plotly is another library that allows interactivity with ggplot visuals by converting ggplot objects into plotly objects. You may need to install the plotly package before loading the library.
# install.packages('plotly')
library(plotly)
ggplotly(p)Or, without loading the library, we can access ggplotly function using ::. This is true for any installed packages/libraries in R.
plotly::ggplotly(p)Density Plot
A density plot shows the probability of observations falling within a sliding window along the variable of interest (in this case “carat”). Density plots are typically used to show data over a continuous interval or time period.
Let’s fill the plot with a color using the fill argument. Notice this is specified outside of aes. This is because we aren’t mapping the fill color to any specific variable - rather we want the entire graph to be the color we specified. What happens if we accidentally include it in the aes argument?
ggplot(data = diamonds, aes(x = carat)) +
geom_density(fill = 'lightblue')What does this look like using the color argument instead?
ggplot(data = diamonds, aes(x = carat)) +
geom_density(color = 'lightblue')fill is generally used for coloring enclosed areas while color is typically used to set the color of the outline or of elements that don’t have fillable areas like lines or points.
Scatterplot
Now let’s look at the distribution of price across different diamond weights (“carats”).
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()Exercise
Create a scatterplot of price and carat where each point is colored purple.
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(color = 'purple')While this plot is interesting, it would be useful to add another dimension. Let’s look at the distribution of “clarity” across this scatterplot.
Notice that “clarity” is included inside the aes argument. This is because the color is determined by the data.
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point()Faceted plots
These plots are “small multiples” of the data. Rather than viewing distribution of different clarities across carat and price, we can look at separate plots at once.
facet_wrap
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point() +
facet_wrap(~clarity)facet_grid
ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point() +
facet_grid(clarity ~ cut)Boxplots
A boxplot is another good way to look at a distribution of data. If it is one-dimension, using only one variable y although ggplot2 requires an number for x.
ggplot(diamonds, aes(y = carat, x = 1)) +
geom_boxplot()We can create multiple boxplots across a categorical variable as well.
ggplot(diamonds, aes(y = carat, x = cut)) +
geom_boxplot()Bar Charts
Counts
Bar charts use the geom geom_bar with the most basic form utilizing a count of observations in the data. When no y variable is present, it automatically generates the bars by calculating the rows. This is considered a stat (or statistical transformation) where count is geom_bar’s default stat.
ggplot(data = diamonds, mapping = aes(x = cut)) +
geom_bar()# geom_bar(stat = 'count')Horizontal Orientation
Method 1: Map to the y variable instead of x.
ggplot(data = diamonds, mapping = aes(y = cut)) +
geom_bar()Method 2: “Coord Flip”
ggplot(data = diamonds, mapping = aes(x = cut)) +
geom_bar() +
coord_flip()Stacked bar chart
Adding a fill to the bar chart can be useful to detect patterns. See the stacked bar chart below.
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar()Stacked bar chart (percentage of total)
While colorful, it is difficult to see proportions. Adding the argument position = "fill" stacks each bar to the same height.
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(position = "fill")Dodged bar chart
position = "dodge" is helpful for comparing groups side by side.
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(position = "dodge")Proportions
ggplot2 has other stats that can be referenced like “prop” which shows proportions. Note that an additional argument for group must also be added.
ggplot(data = diamonds, mapping = aes(x = cut, y = after_stat(prop), group = 1)) +
geom_bar()We can clean modify the y axis to show percentage by adding scale_y_continuous(labels = scales::percent).
ggplot(data = diamonds, mapping = aes(x = cut, y = after_stat(prop), group = 1)) +
geom_bar() +
scale_y_continuous(labels = scales::percent)Set bar length equal to a variable
If we want to use one of our data variables as the y axis, we add the argument stat = 'identity'. This tells R that the statistical transformation is to take the actual values of whatever y argument is defined in the aesthetic mapping.
Let’s find the average price for each cut.
library(dplyr)avg_price_by_cut <- diamonds %>%
group_by(cut) %>%
summarise(avg_price = mean(price))p <- ggplot(data = avg_price_by_cut, mapping = aes(x = cut, y = avg_price)) +
geom_bar(stat = 'identity')# add labels to the bars
p + geom_text(aes(label = avg_price))# add rounding to the label
p + geom_text(aes(label = round(avg_price, 0)))# use the scales package to format with comma
p + geom_text(aes(label = scales::comma(avg_price)))Exercise
Create a bar graph that shows median price by clarity.
diamonds %>%
group_by(clarity) %>%
summarise(median_price = median(price)) %>%
ggplot(aes(clarity, median_price)) +
geom_bar(stat = 'identity')Exercise with mpg data
Let’s look at another dataset included in the ggplot2 package called mpg.
Look at the description of the data with ?mpg. Load mpg into your environment to take a look.
data(mpg)1) Create a scatterplot using the variables x = displ and y = hwy and a color aesthetic of another variable like class.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_color_brewer()2) How would you color all points the same color?
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(color = '#009933')More ggplot2 practice: https://posit.cloud/learn/recipes